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Abstract 

Topmouth culter (Culter alburnus) is an ecologically and economically important 
species belonging to the subfamily Culterinae that is native to and widespread in East 
Asia. Intraspecific variation of semi-buoyant and adhesive eggs in topmouth culter 
provides an ideal opportunity to investigate the genetic mechanisms of spawning habits 
underlying the adaptive radiation of cyprinids in East Asia. In this study, we present a 
chromosome-level genome assembly of topmouth culter and re-sequenced 158 
individuals from six locations in China covering three geographical groups and two egg 
type variations. The topmouth culter genome size was 1.05 Gb, with a contig N50 
length of 17.8 Mb and anchored onto 24 chromosomes. Phylogenetic analysis showed 
that the divergence time of the Culterinae was coinciding with the time of initiation of 
the Asian monsoon intensification. Gene family evolutionary analysis indicated that the 
expanded gene families in topmouth culter were associated with dietary adaptation. 
Population-level genetic analysis indicated clear differentiation among the six 
populations, which were clustered into three distinct clusters, consistent with their 
geographical divergence. The historical effective population size of topmouth culter 
correlated with the Tibetan Plateau uplifting according to the demographic history 
reconstruction. A selective sweep analysis between adhesive and semi-buoyant egg 
populations revealed the genes associated with the hydration and adhesiveness of eggs, 
indicating divergent selection toward different hydrological environments. The present 
study offers a high-resolution genetic resource for further studies on evolutionary 


adaptation, genetic breeding, and conservation of topmouth culter, providing insights 


into the molecular mechanisms for egg type variation of East Asian cyprinids. 


KEYWORDS 


Culter alburnus, population genomics, genetic diversity, egg type variation 


1 Introduction 

Speciation and ecological adaptations in endemic species are important concepts in 
evolutionary biology. Cyprinidae (Teleostei: Cypriniformes) is a species-rich family of 
freshwater fish comprising approximately 3,000 species and 367 genera (Nelson et al., 
2016). This diverse group of cypriniformes is distributed widely in Europe, Asia, Africa, 
and North America. The rapid burst of speciation in cyprinid fishes in East Asia has 
been suggested to be related to the Qinghai-Tibet Plateau (QTP) uplifting and the Asian 
monsoon climate formation, which resulted in a cross-linked, river—lake system, 
leading to remarkable ecological and phenotypic diversification in morphologies, 
feeding habits, life histories, and reproductive strategies (Chen, 1998; Feng et al., 2022; 
He et al., 2004). The endemic East Asian cyprinids derived from a single clade of 154 
species (Nelson et al., 2016; Tan & Armbruster, 2018; Wang et al., 2007) represent a 
useful model for investigating rapid speciation and adaptive radiations. However, 
limited genomic resources are available for this evolutionary important group, and the 
genomic variations to identify diverse adaptive radiations and speciation have been 
reported in a few species (Jian et al., 2021; Wang et al., 2015; Xu et al., 2019). 


Topmouth culter (Culter alburnus) is a Culterinae fish species belonging to the 
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family Cyprinidae, and in China, it is known as the white fish with high economic value 
(Ren et al., 2019). It is a widespread species distributed across major drainages in China, 
except Tibetan Plateau, inhabiting divergent habitats such as rivers, reservoirs, and 
lakes (Qi et al., 2015; Sun et al., 2021). Interestingly, the topmouth culter has two 
ecotypes that differ in spawning habits: in lakes such as Liangzi Lake and Taihu Lake 
in the Yangtze River basin, it lays adhesive eggs that stick to aquatic plants or rocks, 
whereas in other water bodies (including both rivers and lakes), it spawns semi-buoyant 
(floating) eggs that float in fast-flowing waters (Chen et al., 2022; Sun et al., 2021) 
Evolutionary reconstruction of the endemic East Asian cyprinids has revealed an 
ancestral state of spawning adhesive egg and subsequent independent parallel evolution 
of floating eggs, in which some species of cultrins and xenocyprinins evolved a 
transition from floating eggs to adhesive eggs again, such as that found in the topmouth 
culter (Chen et al., 2021; Chen et al., 2023b; Cheng et al., 2022). Notably, the 
differentiation of adhesive and semi-buoyant eggs has been suggested as an adaptation 
for lentic and lotic habitats, respectively (Chen et al., 2021; Chen et al., 2023a). This 
adaptation is closely related to the cross-linked river—lake system shaped by the East 
Asian monsoon climate during the middle Miocene, which drove the adaptive radiation 
of the endemic East Asian cyprinids (Chen et al., 2022; Chen, 1998; Cheng et al., 2022). 
Therefore, the intraspecific variation of egg types in topmouth culter is conducive to 
the study of genetic mechanisms of ecological adaptation of spawning habits in the East 
Asian cyprinids. Studies of the egg types of topmouth culter are limited to the 


embryonic development, and a recent study by Chen et al., (2022) revealed several key 
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pathways associated with egg hydration and adhesiveness in the embryonic 
development of floating and adhesive eggs of topmouth culter through transcriptomic 
and proteomic analyses (Chen et al., 2022). However, the genomic variations in 
different geographic populations and ecotypes remain to be elucidated. 

In recent decades, topmouth culter has become an important aquaculture species 
in China owing to its delicious taste and high economic value. Hybrid lineages of 
topmouth culter with Megalobrama amblycephala (Ren et al, 2019) and 
Ancherythroculter nigrocauda (Zhang et al., 2020) exhibiting desirable traits have been 
developed. However, overfishing, water pollution, and habitat fragmentation or loss 
have been threatening the natural populations of topmouth culter (Qi et al., 2015). 
Previous studies based on putatively neutral markers such as mitochondrial DNA and 
microsatellites have reported the genetic structure of the geographic isolated 
populations of the species (Qi et al., 2015; Sun et al., 2021; Xiong et al., 2019). All 
these studies indicate that wild topmouth culter resources must be protected to prevent 
further decline of its populations. Genetics analysis based on the mitochondrial DNA 
control region suggests that a population in the Pearl River basin is distinct from those 
in the Yangtze River and Heilongjiang River basins, potentially existing as a cryptic 
subspecies (Xiong et al., 2019). A study employing the mitochondrial D-loop region 
and microsatellites reported that two semi-buoyant egg spawning populations in the 
Xingkai Lake in Amur River (Heilongjiang River) basin and the Danjiangkou Reservoir 
in Yangtze River basin were genetically distant from the other four studied populations 


(Sun et al., 2021). However, traditional genetic methods are less efficient in revealing 
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the fine-scale genetic structure, evolutionary history, genomic signature, and key 
adaptive loci related to local adaptation and egg type variations of topmouth culter. 
Thus, highly efficient population genomic approaches that can provide more 
information on genetic parameters, adaptation, and diversification are required. 
High-quality genomic data are essential for investigating the genomic variations, 
genetic diversity, and demographic history of topmouth culter. The first genome 
assembly of topmouth culter was constructed using short-read Illumina and long-read 
PacBio sequencing, covering 1.02 Gb in 5,742 scaffolds, with a contig N50 length of 
72.24 kb (Ren et al., 2019). However, a chromosomal-level genome assembly of 
topmouth culter is not yet available. In the present study, an improved chromosome- 
scale genome assembly of topmouth culter was obtained by a combination of Illumina, 
PacBio, and Hi-C scaffolding techniques. In addition, a population genetic analysis was 
performed to investigate the population structure and gain insights into the intraspecific 
variation of egg types in topmouth culter. The high-quality chromosome-level genome 
assembly can serve as a valuable genetic resource not only for molecular breeding and 
conservation of topmouth culter but also for further investigations of ecological 


speciation in evolutionary radiation of endemic East Asian cyprinids. 


2 MATERIALS AND METHODS 


2.1 Ethics statement, sample collection, and genome sequencing 


All the procedures were conducted following the Animal Care and Ethics Regulations 


of the Animal Experiment Committee (DK2021030), Northwest A&F University 
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(Yangling, China). For genome sequencing, a wild female topmouth culter individual 
captured from the Xingkai Lake (45.2685 °N, 132.7964 °E) in Heilongjiang Province, 
China, in September 2021 was used. Genomic DNA was extracted from the muscle 
tissues by using the DNeasy Mini Kit (Qiagen) according to the manufacturer’s 
instructions. To assist the genome annotation, total RNAs of seven tissues (heart, liver, 
brain, spleen, kidney, muscle, and ovary) were extracted using an EZNA HP Total RNA 
Kit (Omega Bio-tek) and pooled together for cDNA library construction. Paired-end 
libraries with an insert size of 300 bp were constructed and sequenced on the MGISEQ- 
T7, a new sequencer from MGISEQ platform launched by the Beijing Genomics 
Institute (BGI) Tech based on DNA nanoball technology. The MGISEQ platform 
promises to deliver high-quality sequencing data faster at lower prices and its 
performance has been demonstrated to be comparable with the Illumina platform in 
various studies, including whole-genome, whole-exome, transcriptome, single-cell 
transcriptome and metagenome sequencing (Zhu et al., 2021). For PacBio library 
construction and sequencing, high-quality DNA was subjected to size selection by using 
the BluePippin system, and ~20-kb SMARTbell libraries were prepared and ran on the 
PacBio Sequel II CLR platform (PacBio Biosciences). Hi-C libraries were prepared and 
sequenced on the MGISEQ-T7 platform for the chromosome-level genome assembly. 
A total of 158 individuals of topmouth culter were collected. These individuals 
belonged to six populations, namely Xingkai Lake (XKL; n = 30) in the Amur River 
basin (Heilongjiang basin); Dangjiangkou Reservoir (DJKR; n = 38), Yuanshui River 


(YSR; n= 22), Liangzi Lake (LZL; n = 17), and Taihu Lake (TL; n = 22) in the Yangtze 
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River basin; and Hanjiang River (HJR; n = 29) in the Peral River basin. Among these 
populations, LZL and TL populations lay adhesive eggs, whereas the other four 
populations (DJKR, YSR, XKL and HJR) spawn semi-buoyant eggs. The caudal fin 


tissues used for genomic DNA extraction were preserved in 100% ethanol at 4 °C. 


2.2 Genome survey and de novo assembly 


Filtered MGISEQ-T7 sequencing data were used to estimate the topmouth culter 
genome size by using a 17 k-mer depth frequency distribution analysis. A total of 
48,598,039,979 k-mers with the expected depth value of 46.01 were obtained (Figure 
S1). The genome size was calculated as the ratio of k-mer number and k-mer depth. 
Adapter and low-quality regions of PacBio long reads were removed to obtain subreads. 
NextDenove v. 2.3 (Chin et al., 2016) was used to choose different parameters for 
multiple assembly versions, and the final assembly with the default parameters was 
chosen. For genome correction, the PacBio reads and MGISEQ-T7 reads were aligned 
to the assembled genome and NextPolish v. 1.3.1 (Hu et al., 2020) was employed to 
polish the initial genome. For chromosome-level assembly, Bowtie2 v. 2.2.5 
(Langmead & Salzberg, 2012) was used to align the clean Hi-C data to the assembled 
contigs, which were further used to construct the inter/intra chromosomal contact map 
by using Hic-Pro v. 2.11 (Servant et al., 2015). The valid interaction pairs were further 
ordered, oriented, and anchored to the 24 pseudochromosomes with Lachesis (Burton 
et al., 2013) by using an agglomerative hierarchical clustering method. 

To evaluate the quality of the topmouth culter genome, the MGISEQ-T7 short 


reads and RNA-seq data were aligned to the assembly, and the mapping ratio and 
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coverage were assessed. Finally, the genome completeness was evaluated by BUSCO 


v. 5.2.0 using the Actinopterygii gene set (Simao et al., 2015). 


2.3 Genome annotation 


Simple sequence repeats (SSRs) in the topmouth culter genome were identified by 
MISA (Thiel et al., 2003) using default parameters. For other repetitive sequences, the 
Extensive de novo TE Annotator pipeline was used to identify transposable elements 
(Ou et al., 2019). 

Protein-coding genes were predicted based on homology and de novo and RNA- 
seq-based strategies. For RNA-seq-based prediction, transcriptome data of liver, heart, 
kidney, muscle and brain tissues were aligned to the topmouth culter genome and then 
used for gene structure prediction by using PASA v. 2.0.2 (Haas et al., 2003). For de 
novo prediction, the high-quality data set generated using PASA was utilized to train 
ab initio gene predictors including Augustus, SNAP, GlimmHmm, and Geneid. For 
homology-based prediction, protein sequences from seven species were mapped to the 
topmouth culter genome and then homologous genes were predicted using GeMoMa 
(Keilwagen et al., 2016). Finally, all gene models were integrated using 
EVidenceModeler (EVM) (Haas et al., 2008) to generate a nonredundant gene set, and 
transponPSI (Yagi et al., 2014) was used to remove the genes with transposons. 
Functional annotation of the translated amino acid sequences of the final gene sets was 
conducted by alignment to the known databases including Non-Redundant Protein 
Sequence Database (NR), Gene Ontology (GO), InterPro and Kyoto Encyclopedia of 


Genes and Genomes (KEGG) by using BlastP with an E-value threshold of 1e-05. 
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2.4 Gene family and phylogenomic analysis 


To determine gene family evolution in the topmouth culter genome, orthologous and 
paralogous gene families were clustered using the gene models from 11 species, namely 
Oryzias latipes, Triplophysa tibetana, Triplophysa dalaica, Danio rerio, 
Ancherythroculter nigrocauda, Megalobrama amblycephala, Ctenopharyngodon idella, 
Hypophthalmichthys nobilis, Paracanthobrama guichenoti, Onychostoma macrolepis, 
and C. alburnus, by OrthoMCL (Li et al., 2003). Gene family expansion or contraction 
was estimated by comparing the cluster size between the ancestor and each species by 
using CAFÉ (De Bie et al., 2006). GO and KEGG enrichment analyses were performed 
for expanded and contracted gene families by using the Fisher’s exact test. The single- 
copy orthologous genes were used for the phylogenetic analysis and divergence time 
estimations. Multiple sequence alignment was conducted using MAFFT v. 7.429 
(Katoh & Standley, 2013), and poorly aligned regions were filtered and removed using 
Gblocks v. 0.91b (Castresana, 2000). Phylogenetic relationships were inferred using 
RAxML v. 1.5 (Silvestro & Michalak, 2012), with medaka as the outgroup species. The 
MCMCtree program implemented in the PAML software package (Yang, 2007) was 
used to estimate the divergence times. Seven calibration time points retrieved from the 


TimeTree database were applied in the current study (Kumar et al., 2017). 
2.5 Single nucleotide polymorphism calling and filtering 


Genomic DNA of 158 individuals collected from six geographical populations was used 


to construct libraries with an average insert size of 300 bp and then sequenced using 
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the MGISEQ-T7 platform. Raw reads containing adaptor sequences, poly-N (>10%), 
and low-quality bases (Phred quality value <15) were removed. High-quality clean data 
were mapped to the topmouth culter genome by using BWA v. 0.7.17 (mem —M —t 20 - 
k 32). SAMtools v. 1.9 (Li et al., 2009) was used to filter duplicate and unmapped reads, 
sort the reads, and convert them into the BAM format. Single nucleotide 
polymorphisms (SNPs) and insertions and deletions (InDels) were identified using the 
HaplotypeCaller module in GATK v. 4 (McKenna et al., 2010). Raw variant dataset 
without high confidence were filtered using VariantFiltration in GATK with the 
parameters “-- filterExpression --Quality (QUAL) < 30.0, QualByDepth (QD) < 2.0, 
FisherStrand (FS) > 60.0, RMSMappingQuality (MQ) < 40.0, StrandOddsRatio (SOR) > 
4.0, MappingQualityRankSumTest (MQRankSum<-12.5), and ReadPosRankSum 
(RPRankSum <-8.0)”. The genomic variants for population analysis were further 
filtered using VCFTOOLS v. 0.1.13 (Danecek et al., 2011) with the parameters “--min- 
alleles 2 --max-alleles 2 --min-meanDP 5 --maf 0.05 --max-missing 0.5.” Finally, SNPs 


and InDels were annotated to their corresponding chromosomal locations. 


2.6 Population genetic analysis 


Based on the genome-wide SNPs and InDels, Plink v. 1.9 (Chang et al., 2015) was used 
to remove the SNPs which has a high linkage disequilibrium level (--indep-pairwise 
100kb 1 0.8). Principal component analysis (PCA) was performed using EIGENSOFT 
v. 6.14 (Patterson et al., 2006). Population structure was further inferred by using 
ADMIXTURE v. 1.3.0 (Alexander & Lange, 2011) without prior population 


information (K ranges from 2 to 10), and 10-fold cross-validation was performed to 
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determine the probable number of ancestors. Neighbor-joining (NJ) tree was generated 
through IQ-TREE v. 2 (Minh et al., 2020) using the ultrafast bootstrap approach with 
1000 replicates. To infer the historical changes in effective population sizes in response 
to climatic change, we selected one individual from each population with the highest 
sequencing depth and employed the pairwise sequentially Markovian coalescent 
(PSMC) (Li & Durbin, 2011) method with a mutation rate (w) of 4 x 10’ and an 
estimated time of 3 years per generation. The uplift process of the Tibetan Plateau and 
the time range of three phases of intense uplift (Qingzang, Kunhuang and Gonghe 
Movement) were predicted based on previous studies (An et al., 2001; Li & Fang, 1999). 
Genetic diversity indexes, including observed heterozygosity (Ho), expected 
heterozygosity (Hz), and nucleotide diversity (x), were estimated using populations in 
Stacks, and population genetic differentiation index (Fst) were calculated using the 
VCFTOOLS to estimate both global and pairwise divergence among populations. 
Linkage disequilibrium (LD) analysis for each population was conducted on the basis 
of the coefficient of determination (7°) between two given SNPs by using 


POPLDDECAY (https://github.com/BGI-shenzhen/PopLDdecay). 


2.7 Genomic variation analysis 


To identify different loci potentially influencing intraspecific variation in topmouth 
culter egg types, we conducted genomic selective sweeps analysis between two 
pairwise groups: (a) adhesive egg populations (LZL and TL) and floating egg 
populations (DJKR, HJR, XKL, and YSR). Nucleotide diversity (z) ratio and 


divergence index (Fst) were estimated using VCFTOOLS with a 200-kb sliding 
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window in 20-kb steps. Dxy, an absolute measure of genetic divergence between two 


population, was also calculated using genomics_general 


(https://github.com/simonhmartin/genomics_general) with a 200-kb sliding window in 
20-kb steps. The selected windows simultaneously with top 5% values of the z ratio, 
Fsr and Dxy were defined as strong selective sweep regions. In addition, we also 
independently estimated the Fsr between each adhesive egg population (LZL and TL) 
against each floating egg population (DJKR, HJR, XKL, and YSR) to identify the 
divergent regions between LZL and TL. Finally, the genes within or overlapping the 
sweep regions were selected for subsequent gene GO and KEGG pathway enrichment 


analyses. 


3 RESULTS AND DISSCUSSION 


3.1 Chromosome-scale genome assembly of topmouth culter 


For de novo assembly of the topmouth culter, we integrated data from MGISEQ-T7, 
PacBio sequencing, and Hi-C platforms, as illustrated in Table S1. After quality control, 
a total of 132.91 Gb (~100 x depth) of clean reads produced from the MGISEQ-T7 
platform were used for genome estimation. The 17-k-mer analysis showed a genome 
size of 1.2 Gb with a heterozygosity of 0.45% (Table S2). A total of 220.72 Gb (~200 
x depth) PacBio sequencing data with a mean length of 22,404 bp were used for 


assembling the topmouth culter genome. 


PacBio clean reads were used to assemble the genome, and finally, a 1.05 Gb 
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reference genome comprising 262 contigs (> 1 kb) with an N50 length of 17.8 Mb was 
obtained; the constructed genome is superior to the previously published topmouth 
culter draft genome (Tables 1 and S3), which had 34,855 contigs covering 1.02 Gb, 
with an N50 length of 72.24 Kb (Ren et al., 2019). Moreover, the contig N50 length of 
the topmouth culter genome assembly constructed in this study is higher than that of 
the published genome assemblies of other related species, for example, the M. 
amblycephala genome had a contig N50 length of 2.40 Mb (Liu et al., 2021) and the A. 


nigrocauda genome had a contig N50 length of 3.12 Mb (Zhang et al., 2020). 


3.2 Genome anntotation 


Approximately 97.02% of the assembled sequences (1.02 Gb) were anchored onto 24 
chromosomes by using 115.56 Gb clean data generated from Hi-C library (Figure la, 
Table S4), consistent with the previously reported karyotype result (Wang et al., 2009). 
The GC content of the topmouth culter genome was approximately 37.5% (Figures 1b 
and S2), similar to those of the genomes of other cyprinids (Jian et al., 2021; Xu et al., 
2014 Zhang et al., 2020). We observed that approximately 49.79% of the genome 
assembly accounted for repetitive sequences, with DNA transposons (34.58%) and long 
terminal repeat retrotransposons (8.11%) being the most abundant transposable 
elements (Figure 1b, Table S5). In addition, we identified 760,249 SSRs with mono- 
nucleotide repeat ranked the most (49.9%) (Tables S6-S7). Finally, a total of 26,208 
protein-coding genes (Table S8) were identified in the topmouth culter genome by using 
a combination of de novo strategies and homology-based and RNA-seq-based strategies. 


The predicted gene models showed similar distribution patterns with those of other 
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seven fish species in the number and length of CDS, exons, and introns (Figure S3, 
Table S8). Approximately 99.26% of the genes were successfully annotated by 
alignment to the public database (Table S9). To evaluate the completeness of the 
topmouth culter genome assembly, we mapped the MGISEQ-T7 short reads to the 
assembled genome, which indicated a mapping rate of 99.48% (Table S10). Using 
BUSCO, the coverage of 3462 highly conserved single-copy Actinopterygii genes was 
found to be 95.1% and 94.7% for the assembled genome and gene set, respectively 
(Table S11). Moreover, a high collinearity was observed among the topmouth culter, 
grass fish and zebrafish genomes (Figures S4-S5). The aforementioned results 
confirmed that the constructed chromosome-level genome assembly of topmouth culter 


was of high quality. 


3.3 Comparative genomic and evolutionary analyses 


To investigate the phylogenetic relationship among topmouth culter and other species 
and estimate their divergence times, we constructed a phylogenetic tree by using single- 
copy orthologs (Figure lc). The results indicated that topmouth culter and A. 
nigrocauda diverged approximately 3.51 MYA after being diverged from M. 
amblycephala at approximately 5.38 MYA. The divergence times for the three 
Culterinae species were much later than the previously estimated times of divergence 
of topmouth culter (12.84 MYA) (Ren et al., 2019) and A. nigrocuda (8.79 MYA) 
(Zhang et al., 2020) from M. amblycephla, which might be attributed to the higher 
number of cyprinid species and more calibration time considered in the present study. 


The divergence time of the three Culterinae species and C. idellus was 14.02 MYA in 
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middle Miocene, coinciding with the time of initiation of the Asian monsoon 
intensification (Clift et al., 2008), supporting the hypothesis that the burst of 
diversification in the endemic East Asian cyprinids is related to monsoon activity (Chen, 
1998; Chen et al., 2023b; Feng et al., 2022; He et al., 2004). 

Gene family evolutionary analysis revealed that 519 gene families were expanded 
and 267 gene families were contracted in the topmouth culter genome when compared 
with its most recent common ancestor (Figure 1c). Functional enrichment analysis of 
the expanded gene families showed that they were significantly enriched in 24 GO 
terms and 32 KEGG pathways (Table S14), mainly related to proteolysis (GO:0006508, 
p.adjust = 5.73E-09), DNA integration (GO:0015074, p.adjust = 4.15E-16), motor 
activity (GO:0003774, p.adjust = 5.95E-07), myosin complex (GO:0016459, p.adjust 
= 5.95E-07), NOD-like receptor signaling pathway (map04621, p.adjust = 2.07E-41), 
parathyroid hormone synthesis, secretion and action (map04928, p.adjust = 7.03E-08), 
and protein digestion and absorption (map04974, p.adjust = 1.80E-06) (Figure S6). The 
presence of these immune-, nutrition-, and locomotion-related genes in topmouth culter 
is consistent with its carnivorous habit, unlike the other dietary habits such as 
herbivorous and phytoplanktivorous in the endemic East Asian cyprinids, indicating 
that these genes may have a crucial role in species-specific adaptation. The contracted 
gene families were mainly involved in nucleosome (GO:0000786, p.adjust =2.55E-73), 
necroptosis (map04217, p.adjust = 1.25E-09), sulfotransferase activity (GO:0008146, 
p.adjust = 0.007791), and glycosaminoglycan biosynthesis (map00534, p.adjust = 
1.81E-05) (Figure S7). The contraction of glycosaminoglycan biosynthesis genes may 
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be related to the absence of the adhesive layer on the egg envelope of the floating egg. 


3.4 Population structure analyses 


Understanding the populationstructure of topmouth culter holds great importance for 
conservation and genetic breeding studies. Therefore, we re-sequenced 158 topmouth 
culter individuals from six populations, representing three geographical groups (Amur 
River basin, Yangtze River basin, and Peral River basin) and two egg type variations 
(floating and adhesive egg) (Figure 2a). An average size of 15.3 Gb (~14.84x) 2 x 150 
paired data per individual was generated with an average mapping rate and coverage of 
99.26% and 96.69%, respectively (Table S15). After SNP calling and filtering, a total 
of 7,276,044 and 1,587,880 high-quality SNPs and InDels, respectively, were detected. 

Admixture analysis revealed three genetically distinct clusters that were strongly 
partitioned by geographic proximity (Figure 2c). XKL belongs to Amur River basin 
and HJR belongs to Peral River basin were successively separated from the populations 
in Yangtze River basin when the ancestry components (K) increased from 2 to 3. When 
the best-support for K = 4 (Figure S8), LZL and YSR populations in the Yangtze River 
basin clustered together, whereas DJKR and TL populations formed one cluster despite 
being located at a long distance. Considering that the DJKR, situated in the upstream 
reaches of a Yangtze River tributary, was constructed several decades ago (Sun et al., 
2021), the genetic similarity between DJKR and TL may be attributed to a shared 
ancestral polymorphism. Notably, the XKL population showed no admixture when the 
K value increased to 6, suggesting its greater genetic distance from the other 


populations, consistent with the findings of previous studies (Sun et al., 2021; Xiong et 
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al., 2019). Interestingly, the two adhesive populations LZL and TL showed some degree 
of admixture, indicating potential gene flow or parallel adaptive divergence. The NJ 
tree (Figure 2b) and PCA (Figure 2d) recapitulated these groupings. Additionally, the 
position of the outgroup species A. nigrocauda in the NJ tree suggested that A. 
nigrocauda was closer to the topmouth culter population in YSR, consistent with its 
sympatric distribution in the upper reaches of Yangtze River. 

The PSMC analysis revealed two rounds of population declines, which 
corroborated well with the Tibetan Plateau uplifting events (Figure 2e). The peak of 
effective population size (Ne) of the six topmouth culter populations was nearly 3.5 
MYA, followed by a sharp decline, coinciding with two intense uplift phases, which 
are, Qingzang (3.6-1.7 MYA) and Kunhang (1.1—0.6 MYA) movements in the third 
tectonic Tibetan Plateau uplift phase. The second population decline occurred with the 
beginning of Gonghe Movement (~0.15 MYA) (An et al., 2001; Li & Fang, 1999). This 
demographic pattern may be attributed to the remarkable changes in geology and 
climate during Tibetan Plateau uplifting, which may be unfavorable to the topmouth 


culter population. 


3.5 Genetic diversity and linkage disequilibrium 


To evaluate the divergence degree among the three geographical populations of the 
topmouth culter from six locations, we firstly calculated the genetic diversity indexes 
and their pairwise population differentiation coefficient Fst. The observed and expected 
heterozygosity values were similar among populations, with Ho ranged from 0.27 to 


0.30 and Heranged from 0.30 to 0.31 (Table S12). Nucleotide diversity (2) within the 
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population exhibited the highest value in TL (1.94x 10°), followed by LZL (1.93x 10" 
3), YSR (1.92x 107), DJKR (1.88x 107) and HJR (1.78 107), while the lowest value 
was observed in XKL (1.53x 10°) (Table $12). This pattern aligns with the presence of 
egg type variation in the Yangtze River basin. The comparison of Fsr also illustrated 
that the XKL population in the Amur basin was more distant from the populations in 
Yangtze River basin than the HJR population in the Peral River basin, which was also 
supported by the results of population structure (Figure 3a; Table $13). The LD decay 
rates of the six populations varied markedly, with the highest LD level was found in 
XKL, followed by that in HJR, indicating a stronger bottleneck or the founder effect 
(Bray et al., 2010) (Figure 3b). Overall, these results suggest that the genetic 
differentiation among the six topmouth culter populations primarily resulted from 
geographic isolation, and XKL population with the lowest genetic diversity requires 


enhanced conservation efforts. 


3.6 Selection signatures of egg type variation in topmouth culter 


Semi-buoyant eggs of topmouth culter undergo substantial hydration, whereas adhesive 
eggs possess a unique adhesive layer on their envelope, which is responsible for specific 
adaptations to spawning environments (Chen et al., 2022). The intraspecific variation 
of egg type in different topmouth culter populations suggested a divergent selection, 
which may be a strong evolutionary force driving population differentiation. Therefore, 
we conducted selective sweeps detection to find outlier SNPs or diverged regions 


between the adhesive and floating egg populations. 
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Conjoint analysis of z ratios, Fsr and Dxy (both top 5%) identified divergent 
genomic regions containing 72 and 94 genes for the adhesive group (LZL and TL) and 
floating group (DJKR, HJR, XKL, and YSR), respectively (Figure 4a; Table S16). 
Specifically, GO and KEGG enrichment analysis revealed a significant number of 
genes were represented in the pathways of regulation of actin cytoskeleton and lipid 
metabolic (Figure 4b; Figure S9), consistent with the processes of fertilization and egg 
activation during topmouth culter embryogenesis (Chen et al., 2023). We found high 
levels of heterogeneous genomic divergence between the two phenotypic 
differentiation populations scattered across the genome and identified selection signals 
spanning a set of candidate genes overlapped with the key pathways of hydration and 
adhesiveness in the topmouth culter eggs (Chen et al., 2022). For example, Zinc finger 
protein (ZFP) in the zinc metalloproteinase pathway might play a role in the yolk 
protein degradation. Voltage-dependent calcium channel (CACN) might be involved in 
egg envelope permeability transition pore. Procollagen galactosyltransferase 
(COLGALT), collagen alpha-4(VI) chain (COL6A4), COL6A6, fibronectin type III 
domain-containing protein 5 (FNDCS) and integrin alpha-X (ITGAX) as the crosslinks 
of microfilament-associated proteins might contribute to the adhesiveness of adhesive 
eggs (Tang 2020; Whittaker & Hynes, 2002). We also independently compared each 
adhesive egg population against each floating egg population. The pairwise comparison 
revealed similar genomic divergence (genomic islands) between the two adhesive 
populations (Figure S10) and identified many overlapped genes that can be indicative 
of local adaptation responses to different hydrological environments (Table S17). 


20 


438 


439 


440 


441 


442 


443 


444 


445 


446 


447 


448 


449 


450 


451 


452 


453 


454 


455 


456 


457 


458 


459 


Taken together, we believe that these candidate genes may be valuable for further 


functional characterization. 


4 CONCLUSIONS 

The present study reports a chromosome-scale genome assembly for topmouth culter. 
In this study, the genetic relationship of six topmouth culter populations and the 
genomic variation between adhesive and semi-buoyant egg phenotypes based on 
whole-genome resequencing data of 158 individuals were explored. The topmouth 
culter genome constructed in this study is of high quality, with a contig N50 length of 
17.8 Mb, and shows high completeness (BUSCO score, 96.7%). Comparative genomic 
and evolutionary analyses revealed the divergence time and genetic variation of 
topmouth culter with other endemic East Asian cyprinids. The population-level genetic 
analysis revealed distinctive geographical groups and a significantly declined genetic 
diversity in the XKL population. The study also analyzed signatures of selection toward 
egg type variation in the adhesive and floating populations. Genomic data obtained in 
the present study can serve as a valuable resource for further studies on evolution, 
genetic breeding, and conservation of topmouth culter and other endemic East Asian 


cyprinids. 
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Tables and Figures 


Tables 


TABLE 1 Comparison of our genome assembly of topmouth culter with previous study 


Assembly This study 
Assembly approach MGISEQ-T7, Pacbio, HiC 


Assembled genome size (Gb) 1.05 


Contig number 262 

Contig length (bp) 1,053,229,386 
Contig N50 (bp) 17,799,895 
Contig N90 (bp) 2,878,033 
Contig maximum (bp) 44,146,744 
GC (%) 37.50 

Gap number 125 


“Previously published version (Ren et al., 2019). 
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Illumina, PacBio 
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Figure legends 
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Figure 1 Genome features and phylogenetic and evolutionary analyses. (a) Interaction 
matrix across the topmouth culter genome; blocks with higher color intensity indicate 
stronger contacts. (b) Circos graph (from outside to inside) representing the gene 
density, all repeat sequence density, SNP (green) and InDel (blue) density, total genetic 
diversity (z), and the GC content distribution across the chromosomes of the genome 
with 1-Mb sliding window size. (c) Phylogenetic tree based on 2106 single-copy 
orthologs and distribution of homologous genes of the 12 species. The numbers near 


the ancestral nodes indicate the estimated divergence time (MYA, million years ago), 
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with the 95% confidence intervals in parentheses. Endemic East Asian cyprinid lineages 
are indicated in red. Expansion and contraction of gene families are represented as 
green and red numerical values, respectively. The stacked-column plot illustrates the 


distribution of unique genes, single-/multiple-copy genes, other, and unclustered genes. 
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Figure 2 Population structure analysis and demographic history of different topmouth 
culter geographical populations. (a) Geographic locations of the six topmouth culter 
populations. Circles with different colors represent different geographic sites. (b) 
Phylogenetic tree inferred from whole-genome SNPs. (c) Genetic structure of topmouth 
culter with different ancestry kinships (K = 2 to 6). Each bar represents an individual, 
and different colors represent the proportion contributed by that ancestral population. 
Different geographical populations are indicated along the bottom X-axis with different 


colors, as indicated in (a). (d) PCA plots of the first three components of the 158 
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topmouth culter individuals. (e) Demographic histories constructed using the PSMC 
model. The time range of three rounds of intense uplift (Qingzang, Kunhuang and 
Gonghe Movement) is shaded in gray. The black curve shows the Tibetan Plateau uplift 


event, and the right Y-axis indicates the height above sea level. 
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Figure 3 Population diversity analysis. (a) genetic divergence across three basins 
studied. The value in each circle represented nucleotide diversity (z) for this group, and 
the pairwise genetic divergence (Fst) is indicated on the line linked two sites. (b) 


Linkage disequilibrium distance analysis. 
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Figure 4 Identification of divergent regions in the floating and adhesive egg topmouth 
culter populations. (a) Distribution of z ratios (floating/adhesive) and Fst values, which 
were calculated in 200-kb windows sliding in 20-kb steps. Green points in the upper 
left panel are the selective sweep regions for the floating populations, whereas red 
points in the upper right panel are the selective sweep regions for the adhesive egg 
populations. Vertical and horizontal dashed lines correspond to the 5% tails of the 
empirical z ratio (0.89 and 1.06) and Fst (0.032) distributions, respectively. (b) Top 20 
GO terms of the divergent genes between the floating and adhesive egg populations. (c) 
Manhattan plot of the highly divergent genomic regions and overlapping selective 


signals. Candidate genes associated with egg type variation are highlighted in red dots. 
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The dashed line indicates the threshold for selected regions (Fst =0.032). Chr., 


chromosome. 
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